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Abstract 


We consider the setting of linear regression in 
high dimension. We focus on the problem of con¬ 
structing adaptive and honest confidence sets for 
the sparse parameter 9, i.e. we want to construct 
a confidence set for 9 that contains 9 with high 
probability, and that is as small as possible. The 
I 2 diameter of a such confidence set should de¬ 
pend on the sparsity S of 9 - the larger S, the 
wider the confidence set. However, in practice, S 
is unknown. This paper focuses on constructing 
a confidence set for 9 which contains 9 with high 
probability, whose diameter is adaptive to the un¬ 
known sparsity S, and which is implementable in 
practice. 


1 Introduction 


We consider the regression model in dimension p, with n 
observations, 

Y = X9 + e , (1.1) 


where Y is the n-dimensional observation vector, 9 is the p 
dimensional unknown parameter. A' is the nxp design ma¬ 
trix, and e is the n dimensional white noise (see Section [3] 
for a complete presentation of the setting). We focus on the 
high dimensional setting where ««j). 


Models such as the one described in Equation have 
received very much attention recently. In particular, 
finding good estimates of 9 when p is very large has 
many important applications (see [Starck et. al.(2010)] 
|Moriaka and Satoh(2010)| Kavukcuoglu et. al.(2009)| ). 


Solving this problem in a satisfying way is nevertheless 
impossible in general, since it is ill-posed. For this 
reason, and also because it is an assumption that holds 
in many concrete cases, it is usual in this setting to 
focus on the case where 9 is a sparse parameter. Let 
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lo(S) be the Iq “ball” of radius S, i.e. the set of vec¬ 
tors that have less than S non-zero coordinates (that 
are ^-sparse). It has been proved that in some spe¬ 
cific cases, namely when the design matrix X satisfies 
some desirable conditions for p sparse vectors (e.g. null 
space property, R.I.P., restricted eigenvalue condition, 
etc, see e.g. JCandes et. al.(2006)| |Koltchinskii(2009) 
Donoho and Stark(1989)| |Candes(2008) 


Foucart and Lai(2009)| |Bickel et, al.(2009)| ), it is possible 
to construct an estimate 9(X,Y) of the parameter 9 such 
that 


sup sup J 

s-.s<pg<Ei 0 (s) 


9-9f 2 >E 


2 v. pSlog (p/8)\ 


< 6 , ( 1 . 2 ) 


where E > 0 is a constant, where for any vector 

2 = i s the usual I 2 norm, and where 


u. 


P g is the probability according to the noise e when 
the true parameter is 9. This result is minimax- 
optimal over S —sparse vectors for any S < p, see 
e.g. |Raskutti et al(20lT)) . Moreover, this bound on 
the accuracy of 9 scales with the true sparsity S of 9 
that is not available to the learner : the estimate 9 is 
adaptive. Some key references for the existence of such an 
adaptive estimate are | Zou(2006) Candes and Tao(2007)| 
|Bickel et. al.(200971 Biihlmann and van de Geer(2011)| . 


Although this problem is a difficult combinatorial 
problem, there exist some computationally feasible 
techniques, under stronger assumptions on the de¬ 
sign A, for instance the thresholding procedures, the 
orthogonal matching pursuit, the Lasso, the Dantzig 
Selector etc. For more references on the techniques 
and the associated bounds and design assumptions, 
see e.g. | Donoho and Stark(1989) Tibshirani( 1 994)| 


|Donoho(2006)l[Candes et. al.(2006)l|Lee et. al.(2013)| . 


Another important problem is the one of confidence 
statements for the parameter 9, i.e. of quantifying the 
precision of an estimate of 9. Constructing confi¬ 
dence sets in high dimensional regression was stud- 


ied e.g. in the papers 

Abbasi-Yadkori et. al.(2012) 

Javanmard and Montanari(2013) 

Beran and Diimbgen(1998)| 

Baraud(2004) 


|Nickl and van de Geer(2013j| and it is, with the problem 
of estimating 9, the second fundamental problem of 
inference in this setting. The objective is to construct 
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a set C n that contains 0 with high probability, and also 
that is as small as possible, i.e. that is such that its i 2 
diameter is as small as possible. One can deduce from 
the lower and upper bounds for the estimation problem 


(see |Raskutti et al(2011) 

Nickl and van de Geer(2013)| 

Javanmard and Montanari(2013) 

), that the optimal I 2 - 


width of a confidence interval for the sparse vector 8 
should depend on its sparsity S - it should be of order 
y/S log (p)/n. For <5 > 0, if 9 is S sparse and S is known, 
and if 9 is an estimator of 9 that satisfies Equation Qa 
^-confidence interval C n of coverage 1 — <5 should ideally 
be of the form 


C n — {ti £ 


2 — 


E S\og(p/S) ^ 


On the other hand, if the sparsity S of the parameter 9 is 
unknown, which is the case in real applications since 9 is 
unknown, one cannot construct directly this optimally sized 
confidence interval. 

For the problem of estimating 9, computationally feasible 
techniques that are adaptive to the unknown sparsity S 
of 9 exist (see Equation and associated references). 
Do similar results hold for the problem of constructing 
a confidence set for 91 In particular, can one construct, 
in a non computationally extensive way, a confidence set 
for the adaptive estimate of Equation ( |1.2| ) that contains 
9 with high probability and whose diameter is adaptive to 
the unknown sparsity S of 91 This is the problem that this 
paper targets. We would like to insist on the importance of 
the construction of confidence sets. Indeed, most of the se¬ 
quential learning algorithms rely on such confidence sets. 
For instance, in the papers |Abbasi-Yadkori et. al.(2012)] 
Carpentier and Munos(2012) 

Desphandes and Montanari(2012)| that are on the topic of 
linear bandit in high dimensions, the authors assume that 
an upper bound S on the sparsity S is known, and they 
consider a large confidence interval for 9 whose diameter 
depends on S. The final bounds on the regret of their 
bandit algorithms then depend on the chosen upper bound 
S, and not on the correct sparsity S of 9. In this setting, 
it would be quite useful to have an adaptive and honest 
confidence set for 9, that adapts to the sparsity S. The 
bound on the regret would then depend on S and not on S. 


2 Literature review 


The problem of constructing a confidence set, 
when the support of the parameter or its sparsity 
S is known, is a problem that has received atten- 


tion recently, see e.g. 

Abbasi-Yadkori et. al.(2012), 

Javanmard and Montanari(2013) 


Javanmard and Montanari(2014j 

Beran and Diimbgen( 1998)1 Baraud(2004), 

Lee et. al.(2014) 

van de Geer and Biihlmann(2014) |. 


The 


papers 


| Javanmard and Montanari(2013) 


|Javanmard and Montanari(2014) 


Lee et. al.(2014) 


van de Geer and Biihlmann(2014)| are concerned with 
finding the limiting distribution of an estimate of 9 , and 
using it to build tests and confidence sets for fixed, low 
dimensional sub models (fixed subsets of coordinates of 
9). This approach does not provide an optimally sized 
confidence set for the parameter 9 itself, since its support 
(i.e. the correct low dimensional model of interest) is not 
known. On the other hand, the problem of constructing 
an adaptive and honest confidence interval for 9 when S 
is unknown, has only recently started to receive attention. 
It is an important problem, since there is no reason why 
the low dimensional support of the parameter has to be 
known before hand : therefore, the low dimensional model 
of interest cannot be chosen efficiently in a non data driven 
way. This problem is related to the problem of estimating 
the sparsity S of the parameter 9, as explained in various 
related settings in the papers ||Hoffmann and Lepski(2002)| 


Juditsky and Lambert-Lacroix(2003) 

Gine and Nickl(2010)[ |Nickl and van de Geer(2013)| . 


Indeed, if a good estimate S of S is available, then one 
could consider the confidence interval 


C n — {u £ 


\u-e\\ !<E gloe(i,ffl }. 


Let us first consider a simpler instance of this problem that 
will enlighten its difficulties. It is the case where one wants 
to be adaptive to two sparsities So < S± (and not to any 
sparsity S ). The objective is to construct a confidence set 
C n that is adaptive and honest , i.e. that contains 9 with high 
probability, and whose diameter is of order yjS q log {p)/n 
if 9 is of sparsity So or below, or of order sj Si log (p)/n 
if the sparsity of 9 is between So and Si. In other words, 
the objective is to construct a set C n based on the data such 
that for V = Iq(Si), for 1 = {S 0 , Si}, and for 5 > 0, 

max sup P e (9 £ C n ) >1 — 5, and 

s&x 0et o (S)fTP 

max sup Pe(|C' n | 2 > < S, 

SeI eei 0 (S)f]v ' * n J 

( 2 . 1 ) 

where E' > 0 is a constant and where C n | 2 = 
sup{||u — t>|| 2 , (u, v) £ C^} is the diameter of C n . 
There is however a serious obstruction to the creation 
of a such confidence set. It is possible to prove (see 
e.g. |Baraud(2004)[ |Nickl and van de Geer(2013)| ) that in 
many situations, there exists no adaptive and honest confi¬ 
dence sets on the entire parameter space V = lo(Si). The 
problem is that there are some parameters that are not 5 0 
sparse, but that are very close to So sparse vectors, and 
for which it is impossible to detect that one needs a con¬ 
fidence set of diameter yj E' S\ log (p/5)/n (since a con¬ 
fidence set of diameter sjE'So log(p/<5) jn won’t provide 
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enough coverage). A reasonable and important question is 
then to provide a confidence set that is adaptive and hon¬ 
est on the largest possible model V C Iq(Si). Intuitively, 
this model V should be the set Iq(Si) where the problem¬ 
atic parameters that are not ,S'o sparse, but are very close to 
Iq(Sq) have been removed. 

There has been recently some very important advances on 
this problem in the paper |Nickl and van de Geer(2013)] . 
They define the separated set l 0 (Si , p) for a constant p > 0 
as 

k(S\,p) = {u e lo(Si) : ||u - Z 0 (Sb)||2 > p}, (2.2) 

where, if A C M p , we write for u £ R p , ||u — A \\ 2 = 
inf „ eJ 4 ||u — 11 2 - They then define 

V~V(p) := l 0 (S 0 )\Jf 0 (Si,p). 

This new model excludes vectors that are not So sparse, but 
at a distance that is less than p from Sq sparse vectors. The 
smaller p, the more similar V is to Zo(*S'i) (equality holds 
when p = 0). The restriction to the model V can be seen 
as a margin condition : the p— margin condition is satisfied 
if the true parameter 9 belongs to a sub-model where the 
two classes of sparsity are p— away of each other, i.e. if 
9 belongs either to lo(So), or to lo(Si,p). This margin 
condition is necessary for being able to distinguish between 
the two sets of sparsity So and Si. 

The objective is then to characterize the smallest possi¬ 
ble p for which a such confidence set exists, and then 
to construct this confidence set. Table Q] summarizes 
the minimax-optimal order of p := p n (with lower and 
upper bounds) such that an adaptive and honest confi¬ 
dence set for 6 exists in V(p n ), in function of So, Si 
(see flNickl and van de Geer(2013j| ). All upper bounds 
and lower bounds match in the three cases summarized 
in the table, and linin^oo p n = 0 at a rate which de¬ 
pends on So, Si, which implies that V(p n ) converges 
to Zo(*S'i). The minimax-optimal rates at which p n can 
go to 0 are known for this problem, but an impor¬ 
tant issue that remain on the existence of computation¬ 
ally feasible adaptive and honest confidence intervals in 
cases (ii) and (iii) of Table [T] Indeed, the procedure 
in BNickl and van de Geer(2013)| consists on computing 
a quantity of the form inf ue j 0 (,g 0 ) |i„(u)| where t n (u) is 
some quadratic functional of the data and u, then testing 
if this quantity exceeds a threshold, and finally using the 
output of this test for constructing the confidence inter¬ 
val. Computing this statistic is a computationally extensive 
problem, since its computational complexity is of order 
p So n. It is thus not proper for concrete applications. Also, 
they assume that their design matrix is a random Gaussian 
matrix, which is quite restrictive. Another aspect that more¬ 
over prevents the use of this method in practical applica¬ 
tions is the fact that sparsity is unlikely to hold exactly in 


practice, and that results involving approximate sparsity are 
more appealing. 

In this paper, we provide a computationally feasible 
method for constructing an honest and adaptive confidence 
set on a maximal model V (in the minimax-optimal sense 
of Table [TJ, in a more general setting, i.e. in the setting of 
approximate sparsity for the parameter 9 , and under more 
general assumptions for the design matrix X , namely that 
its condition number is not too large for S\ sparse vectors. 
The confidence set we propose is actually trivial to imple¬ 
ment and its computational complexity is of order 0(pn) 
(provided that an adaptive estimate satisfying Equation |1.2| 
is available). We first provide a method in the case of two 
sparsity indexes (that achieves adaptivity to two sparsities 
Sq , Si), and we then extend it to the more general setting of 
many sparsity indexes X. In particular, the method we pro¬ 
pose applies to constructing a confidence set that is adap¬ 
tive and honest for all the sparsities smaller than a large 
sparsity index p. It is minimax optimal also in this set¬ 
ting, adaptive, and implementable in practice. We test our 
method with numericals simulations and all proofs are in 
the supplementary material. 

3 Setting 

Let n and p be two integers with n <C p. Consider the 
model defined in Equation m 

3.1 Assumption on the noise e 

We state the following assumption about the noise e, 
namely that its entries are independent and sub-Gaussian. 

Assumption 3.1. The entries of e are independent. More¬ 
over, Vi < n, Ee, = 0,Vei = cr 2 and Vi < n,V A > 0, 
there exists c > 0 such that 

Eexp(— Xa) < exp(—A 2 c 2 /2). 

For instance bounded random variables and Gaussian ran¬ 
dom variables satisfy this Assumption. 

3.2 Assumption on the design matrix X 

We make the following assumption about the design matrix 

X. 

Assumption 3.2. Let p > 0. The matrix X is such that 
there exists two constants Cm > c m > 0 such that for any 
u that is p— sparse, we have 

Cm|M| 2 < \\-y=Xu\\ 2 < C M \\u\\ 2 . 
s/n 

This assumption is a relaxation of the celebrated R.I.P. con¬ 
dition (see [Foucart and Lai(2009)l for another paper in 
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Case (i) 

Case (ii) 

Case (iii) 

Values of S 0 and Si 

S 0 = o(n 1 / 2 / log(p)) 

S 0 = o(n 1/2 /log(p)) 

S 0 = o(n/ log(p)) 


Si = o(n 1 / 2 / log(p)) 

Si = o(n/ log(p)) 

Si = o(n/ log(p)) 

UB on p 

^51 log(p) 

n 1//4 

0 

LB on p 

^1 Si log(p) 

n 1//4 

0 

Computational complexity 

np 

np b ° 

np b ° 


Table 1: Upper and lower bounds on the parameter p 2 for the problem of constructing an honest and adaptive confidence 
set. UP stands for Upper Bound, LB stands for Lower Bound. 


this setting). This condition makes sense, when n <C p, 
only if p is actually smaller than n, i.e. p = 0(n/ log(p)). 
In this case, for instance, random Fourier matrices, and 
more generally RIP matrices, satisfy this condition, with 
c m and Cm close to 1. More generally, e.g. random matri¬ 
ces with sub-Gaussian entries whose variance-covariance 
matrix has bounded condition number satisfy this condi¬ 
tion for p = 0(n/ log (p)) and c m and Cm depending on 
the condition number of the variance-covariance matrix. 

3.3 The set of vectors of approximate sparsity 

We are interested in situations where 9 is approximately S- 
sparse. More specifically, we focus on vectors 9 that have 
less than S “large” components, but that can have up to p 
“small” components such that their 1-2 norm is not too large. 
Definition 3.1. We define the following sets of approxi¬ 
mately S—sparse vectors, for B, C. p, S > 0, as 

Ss(C, B,p, S) = jit e l 0 (p), |M| 2 < B : 

||„-WS)||j< CS1 ° gW<i) }. 

where for a vector u, ||u-Z 0 (S)||| = inf„ g z o(S ) \\u-v\\l = 
Y^j=s+i u ^(j ]), where iq.) is the ordered version of |w|, 
i.e. is such that |u(i) | > |u( 2 )| > ... > |rt( p ) |. 

The vectors in these sets have at most S “large” compo¬ 
nents, and the p — S remaining components have small I 2 
norm. An important property of Ss(C,B,p,6) is that it 
contains all S'—sparse vectors whose lj norm is bounded 
by B, and is an enlargement of the set of sparse vector to 
“approximately” sparse vectors. 

Let 0 < So < Si be two sparsities. Similarly to what is 
proposed in Equation ( |2.2| ), we define the separated set as 

SsA C ,B,p,S,p ) := S Sl> s 0 (C,B,p,S,p) 

= {u€ S Sl (C,B,p,S) : ||u - Z 0 (S 0 )|| 2 > p}- (3.1) 

These sets are such that, between Ss a and Ss x , there is a 
p-margin condition. 


I 2 norm, which are actually the sets considered in pa¬ 
per [Nickl and van de Geer(2013j| . Indeed, they corre¬ 
spond to the vectors in the enlarged sets Ss 1 (C, B,p, 6) 
that are at least p-away from lo(So)- 

3.4 Adaptive and honest confidence sets 

We now provide the definition of adaptive and honest con¬ 
fidence sets in the wider model of approximately sparse 
vectors. It is an extension of the definition provided in 
Equation ( |2. 1 1 > to the larger set of approximately sparse vec¬ 
tors (it demands that the second equation in Definition ( |3.2| ) 
holds also for approximately sparse vectors). 

Definition 3.2. Let <5, C, > 0. A set C n is 
a S —adaptive and honest confidence set for V C 
Ss 1 (C, B,p, 5 ) and for J C {1,... ,p} if 

max sup P g(9 G C n ) >1 — 5, and 

S ^ T 9&S s (C,B,p,5)C)V 

max sup Pe(|C n | 2 > JE , S ^° g < S, 

SeI eeSs{c,B,p,8)r\v v * n J 

where E' > 0 is a constant. 

4 Adaptive estimation of 9 on the enlarged 
sets 

We are first going to prove that on these enlarged 
sets Ss(C, B ,p,S), adaptive inference remains possible, 
i.e. that it is possible to build an estimate of 9 that satis¬ 
fies results similar to what is described in Equation ( |1.2| ). 
More precisely, we prove that if the design is not too cor¬ 
related (c m and Cm not too far from 1 in Assumption |3.2[ ) 
then the lasso estimator will provide good results on the 
enlarged sets. 

Theorem 4.1 (Adaptive Lasso on the enlarged sets). Let 
Assumptions [XT] and |i.2| be satisfied for c,c m ,CM,6()p 
such that c > 0, c m > 2/3, Cm < 4/3 and p > 0. Let 
B > 0 and C > 0. Let 5 > 0. The solution 9 ofl\ mini¬ 
mization or lasso 


For the same value of p, these sets are strictly larger 
than the sets presented in Equation \2.2\ with bounded 


9 = arg min 

U 



Xu\\l + Ky/logip/SjnWuWx , 
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where n > 4max(c, y/C / 3, c 2 , C/9) is such that we have 
VO <S<p 

sup ^e(\\d-9\\l 

9&S s {C,B,p,S) K 

>(l2(36« + 36 ) 2 + C 2 )^?V))<A 


significant or not. If both these quantities are small enough, 
the test is accepted, otherwise it is rejected. The outcome 
of this is the test \D n . The second step is to use this in¬ 
formation to construct the confidence set C n . Based on 
, the confidence interval C n will be of diameter of order 

(if tfn = 0), or (if „ = 1). The 

procedure is explained in Algorithm|T| 


The proof of this theorem is in the Supplementary Mate¬ 
rial (see Appendix |B|). Proving this bound for the lasso on 
the enlarged sets is actually very similar to proving it on 
the set of exactly sparse vectors. An important remark 
is that the lasso’s computational complexity is not high 
and is well defined, see |Tibshirani( 1994)| . As usual, the 
lasso does not work on too correlated designs, it can be 
applied when c m > 2/3 and Cm < 4/3. When this 
is not satisfied, other techniques have to be considered, 
see e.g. flFoucart and Lai(2009)| . It is actually possible to 
prove that for any 0 < c m < Cm, there exists an es¬ 
timate that satisfies a result similar to the one in Theo¬ 
rem [4T] see Theorem D.l in the Supplementary Material 
Appendix [D] This estimate is however the result of / (l min¬ 
imization, and is thus computationally extensive (the com¬ 
putational complexity is np s °), and is in practice not im- 
plementable whenever p, S 0 are too large. 


The really nice feature of such a result is that it provides an 
estimate whose ( 2 -risk is adaptive uniformly to the sparsity 
of any vector of the enlarged sparsity class, for any sparsity 
smaller than p. The estimate is data driven, but it needs 
an upper bound C on the amount of which 9 deviates from 
the sparsity S, and also it needs an upper bound c on the 
parameter that bounds the sub-Gaussian tail of the noise. 


Algorithm 1 Two sparsity indexes confidence set 


Parameters: 6 , So, ,Sj . o' 2 

set the following quantities, computed on 
first half of the samples only, as B 2 


the 


and 


3/2(n -1 E*<„ Y?{ 1 + 2 log(l/<5)) + 2 log(l/<S)), 

T n ■= 14|-B|^/n -1 / 2 log(l/ ft) + SSllBiy^ 0 ^, 


and t[ \ := 330|f?| 


Si log (p/s) 


and let 9 be the lasso 


estimate as in Theorem 4.1 All these quantities are 
computed on the first half of the sample, and from now 
on we only use the second half of the sample, 
set the residual r = Y — X9 
set the test statistic R n = ||f||| — ncr 2 
set the test ^„ = 1 - 1 {R„ < t 2 }1{E^=s 0+ i 8fj) < 
( r n) 2 }' where 6^ is the ordered version of \0\ 
set the confidence interval 


C„ : ={u e S Sl : ||u - 8\\ 2 < 650 J S ° l ° S{l>) l{tt n = 0} 

V n 

return C n 


5 Adaptive and honest confidence sets for 9 

We now propose a method that is computationally feasible 
for constructing an adaptive and honest confidence set for 
9. We fist present this method in the case of adaptivity to 
only two sparsities So < Si (two sparsity indexes method), 
and then explain how to extend these results to larger sets 
of sparsities (multi sparsity indexes method). 

5.1 Presentation of the confidence set for two sparsity 
indexes S 0 < Si 

Construction of the confidence set Let S 0 < Si < p. 

The algorithm for the two sparsity indexes. Algorithm [T| 
contains two main steps. The first step is to construct a test 
Tn for deciding whether 9 is So approximately sparse, or 
Si approximately sparse. The test consists in first comput¬ 
ing an adaptive estimate 9 of 0, and then on thresholding 
all non-significant components. Then, the testing decision 
is based on two factors (i) testing whether the number of 
non-zero entries of the thresholded estimate is larger than 
So and (ii) testing whether the squared residuals ||r||| are 


The parameter er 2 car be replaced by a consistent estimator 
of the variance of the noise e (e.g. a Bootstrap estimate, a 
cross validation estimate, etc). 

Main result The following theorem states that this confi¬ 
dence interval is adaptive and honest (in the sense of Defi¬ 
nition ( |3.2| i) over a large model V. 

Theorem 5.1. Assume that the noise is either Gaussian of 
variance less than 1, or bounded by 1, and assume that the 
assumptions for convergence of the adaptive lasso, stated 
in Theorem \4.1\ are satisfied, and that S± < p. Then the 
confidence set presented in Algorithm^is S adaptive and 
honest for I = {So, Si} and over the model 

V(p n ) = Ss 0 (32,oo,p,5)lJSs 1 (32,oo,p,(5,p n ), 

where 

p n = \B\ min ^54v / log(l/5)n' 1/4 ,460^^— ^^ ). 

By definition of the enlarged set, this implies in particular 
that the confidence set is 5 adaptive and honest for I = 
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{So, Si} and over the model 

'P(Pn) = lo{So) 1J lo{Sl,Pn)> with Pn = 2 p n 


This theorem is actually a corollary of a more gen¬ 
eral result, presented in the Supplementary Material, Ap- 
pendix|A| The proof of this theorem is in the supplementary 
material (Appendix |C|). 

The confidence set is adaptive and honest under the same 
assumptions that ensure consistency of the lasso estimate. 
It is quite reasonable that it is so, since the creation of adap¬ 
tive and honest confidence sets is a strictly more difficult 
problem than the problem of estimating the parameter (in¬ 
deed, any point of the confidence set is a good estimate of 
the parameter). Also, since lim„_ > , 00 p n = 0, for any 9 and 
for n large enough, the confidence set contains 9 with high 
probability, and its diameter adapts to the sparsity of 9. It 
is adaptive to the two sparsities {So, Si} only and not to 
the whole spectrum of sparsities, but it already allows to 
improve many existing learning algorithms by diminishing 
the size of the confidence interval (by not always setting it 
tO y/Sl lOg (p)/ n independently of 9). Moreover, it com¬ 
putational complexity is of order np, which is linear. 


Comparison with results in pa¬ 
per [Nickl and van de Geer(2013)1 . Our results imply 
all the upper bounds of | Nickl and van de Geer(2013)| 
in all cases (i), (ii) and (iii) of Table [T] i.e. imply the 
upper bounds on exactly sparse sets (this is illustrated 
in Theorem HU- Also, our confidence set is adap¬ 
tive and honest in all three cases (i), (ii) and (iii), and 
we do not need to change the construction method 
as in paper jNickl and van de Geer(2013)] . Our as¬ 
sumptions on the design of X are weaker than in the 
paper |Nickl and van de Geer(2013)| , where the au¬ 
thors consider Gaussian design which a fortiori satisfies 
Assumption 3.2 with high probability. Finally, the 
confidence set is, as we saw, computationally feasible, 
since its computational complexity is of order np. As 
mentioned in the introduction, the procedure in the 
paper jNickl and van de Geer(2013)l boils down to min¬ 
imizing over the set of Sq— sparse vectors a quadratic 
quantity, which has complexity of order p s °n. This implies 
that our procedure is computationally efficient on a set that 
is as large as possible in a minimax sense, as illustrated by 
the lower bounds in Figure [T] 


5.2 Adaptive and honest confidence sets for multiple 
sparsities 

Construction of the confidence set In the last subsec¬ 
tion, we restricted ourselves to constructing a confidence 
set that is adaptive to only two sparsities So, Si. Although 
it is already useful with respect to existing techniques that 
are not adaptive at all, it is only a first step toward a more 


global result where all sparsity indexes I = {1,... ,p} are 
considered. Algorithm [2] solves this problem. 

Algorithm 2 Multi sparsity indexes confidence set, second 
version_ 

Parameters: <5, a 2 

set using only the first half of the 2 n samples, 

B 2 := 3/2(rf 1 £ Y 2 { 1 + 2 log(l/«5)) + 2 log(l/<S)) , 

z<n 

and r n (S) := U\B\y/n- 1 / 2 log(l/<5) + 

3 8 1 |£|^M; and r’ n (S) 

330\B\yf^± 1)l ° s(p/S) , and let 9 be the lasso esti¬ 
mate as in Theorem l4.ll 
set the residual r = Y — X9 
set the statistic R n = ||r||| — a 2 

set for any S < p the statistic R' n (S) := Y^j=s+i^\jy 
where 9( j is the ordered version of \9\. 
set S as the smallest S < p such that R n < 
T n (S) 2 , and, R' n (S) < «(,S)) 2 . 
set the confidence interval 


C n := {ueW : ||u-0|| 2 < 650t 


1 Slog(p/S) 


return C n 


The parameter er 2 car be replaced by a consistent estimator 
of the variance of the noise e (e.g. a Bootstrap estimate, a 
cross validation estimate, etc). 

The following theorem holds in this case (it is a direct con¬ 
sequence of Theorem |5.1| >. 

Theorem 5.2. Assume that the noise is either Gaussian of 
variance less than 1, or bounded by 1, and assume that the 
assumptions for convergence of the adaptive lasso, stated 
in Theorem \4.1\ are satisfied, and that for S\ < p. Then the 
confidence set presented in Algorithm^} is 5 adaptive and 
honest for I = {1,... ,p} and over the model 

P 

V := (32, oo, p, S) uu 5s,s-i(32,oo,p,c5, p n (S)). 

S =2 


where 

pn{S) = |B| min ^50 v/log(l/^)w~ 1/4 ,460^/^— ^ . 

A more general procedure is presented in the Supplemen¬ 
tary Material, Appendix [A] 

Discussion This result is minimax optimal from the 
lower bounds in Figure |T] and it is also computationally 
feasible. The resulting confidence interval is adaptive and 
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honest for all indexes X over V. Moreover, V is signifi¬ 
cantly larger than the set of “detectable” parameters such 
that all non-zero component are larger than ^/log (p/S)/n. 
For this reason, this method is more efficient than the naive 
method of counting the number of non-zero entries in a 
thresholded adaptive estimate, and using this number for 
constructing the confidence set. 

6 Experimental results 

In this section, we present some simulations and also some 
applications on images. 

6.1 Simulations 

In order to illustrate the efficiency of our method, we apply 
it to simulated data. We consider a problem in dimension 
p = 10000 , and where n = 1000 (the sampling rate is 
10%). Let 0 < So < Si be the two approximate sparsity 
levels. We are going to define three types of distributions 
(priors) on the set of parameter 9: 

• 9 ~ 0i : (i) So random coordinates of 9 are Af(0, 1) 
and (ii) the remaining coordinates are A/"(0, <Jg) where 
erg = s ° . With high probability, 9 G 

Ss 0 {C,B,p , 5 ). 

• 9 ~ 02 : (i) Si random coordinates of 9 are Af( 0,1) 
and (ii) the remaining coordinates are Af( 0, <Jq). With 
high probability, 9 G <S Sl (C, B,p, <5, p 2 n ) where p^ = 
Q( S 1 log(p) I So log(p) \ 

^ v n ' n /' 

• 9 ~ © 3 : (i) S 0 random coordinates of 9 are Af(0, 1) 
and (ii) the remaining coordinates are Af(0, erf) with 

= C + ^ M ). With high probability, 

9 G S Sl (C,B,p,6,pl), where p 2 = 0(n -1 / 2 + 
Solog(p) \ 
n /' 

See FigurejTjfor an illustration of this. 



Figure 1: Mean of the square of the re-ordered coordinates 
of 9 sampled according to priors @ 1 , © 2 , and © 3 . 

The distributions correspond to two extremal cases in 
which the sampled vector 9 is not approximately sparse. 


i.e. either the norm of the tail coefficients is large, or the 
number of detectable coefficients is larger than So- For a 
given 0 ~ ©fc (fc G {1, 2, 3}), we write d/ for its “class”, 
i.e. if the hypothesis in test Q to which they belong with 
high probability. This means that ’I' = 0 for 0 ~ ©1 and 
dr = 1 for 9 ~ ©2 or 9 ~ ©3. 

For all distributions ©/., we do 10000 experiments (cor¬ 
responding to trying our methods over 10000 samples of 
9 sampled according to Q/J where we perform the test 
we described in Section [5] for the testing problem ( |C. 1 [ 1 , 
infer the sparsity, compute adaptive confidence interval, 
and compute the risk of the estimate. The sampling ma¬ 
trix X is composed of Gaussian random variables, and the 
noise e is i.i.d. Gaussian with variance 1. In this design, 
p = 0{n/ log(p)). We compute an estimate of 9 via hard 
thresholding, which happens in this orthogonal setting to 
be equivalent to lasso, on the first half of the samples. We 
then construct the test on the second half of the sample. We 
summarise the results in Table [2] 

A first general remark is that the test we consider manages 
to distinguish efficiently between H (J and H \ , for many dif¬ 
ferent configuration of sparsity. The adaptive and honest 
confidence sets we built using this test are also quite effi¬ 
cient. The probability that the true parameter belongs to 
the adaptive confidence set is very close to the probability 
of correctly inferring the class of 9. The strength of these 
sets is to be adaptive to the sparsity of the problem, i.e. they 
contain 9 with high probability and do not have the same 
width depending on the complexity of the true parameter 
(So or Si)- As a matter of fact, the width of the adaptive 
confidence set is close to the value of the risk of the adap¬ 
tive estimate, which is exactly what is wanted. It is partic¬ 
ularly interesting, since as expected, the risk is much larger 
under H 1 than under Hq. 

In the case of distribution 0 3 , it is interesting to remark 
that although the sparsity of 9 is close to So in expectation, 
it does not prevent our test to efficiently classify it as Hi. 
This is actually quite important in terms of confidence sets 
since we can observe that, for each configuration of spar¬ 
sity, the risk and thus the width of the adaptive confidence 
interval, is much larger for ©3 than for ©i. A test only 
based on the inferred sparsity (i.e. on || 0 ||o) would not have 
detected this since the inferred sparsity is approximately 
similar in these two cases. 

6.2 Application of the method to images 


We consider now a more concrete setting, where we apply 
our method to images. We focus on black and white draw- 
ingsQ The particularity of such images is that only a few 

'Note that a pre-treatment, like differentiation, can be applied 
to regular images to transform them into drawing-like images. 
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Si = 5 

and 

s 2 = 10 

Si = 5 

and 

S 2 = 10 3 

Si = 50 

and 

S 2 = 10 3 

Prior 

0i 

02 

03 

0i 

02 

03 

0i 

02 

03 

p<>(® b % 

5 nr 7 

1 10 -2 

X 

1 10 -2 

8 10" Y 

8 10" 2 

1 10“ 3 

2 10 t> 

6 10" 3 

Epllo 

4.8 

9.7 

X 

4.8 

2.1 10 3 

4.8 

48.7 

2.3 10 3 

49.3 

P e(egc n ) 

1.2 10 1 

9 itr 2 

X 

5 10" 2 

9.10~ y 

8.1 10“ 2 

3.2 10“ 3 

4 10 -5 

7.8 10" 3 

E|£7„| 

6.16 10~ 2 

1.8 HT 1 

X 

7.76 10~ 2 

9.8 10 3 

5.7 10" 1 

27.6 

9.9 10 3 

9.7 10 3 

n\o-e\\i 

3.37 itr 2 

1.1 10 _1 

X 

3.7 10" 2 

9.6 10 3 

3.2 10" 1 

15.04 

9.7 10 3 

9.4 10 3 


Table 2: Expected results for the test, risk and adaptive and honest confidence set for the three priors. 



Figure 2: First column = original image. Second column= 
reconstructed image. Third column: extremal point of the 
confidence set that minimises the contrast. The test of 3%p 
sparsity is accepted for the first image but rejected for the 
other. 


pixels are non-zero, i.e. if we align the columns of such an 
image, it can be considered as a sparse vector 6 of dimen¬ 
sion p where p is the number of lines times the number of 
columns of the image. Such an image 6 can easily be com¬ 
pressed by conserving n Fourier coefficient of this vector, 
chosen at random frequencies. We write X the nxp matrix 
that represents this convolution. The model is then again 
Y = X9 + e, where e is some noise to the compression, 
due for instance to transmission. An interesting question 
when observing the compression Y of an image is to infer 
what is the quality of the reconstruction, i.e. if the image is 
indeed quite sparse or not. 

When one then observes such a compression, it is possible 
to reconstruct the image 9 by, as we explained in Section]?] 
Also, the test and the confidence sets are build in the same 
way as what was done in Section]?] 

We consider here an image with p = 7200000 pixels. We 
consider n = 5 %p = 36000 Fourier coefficients obtained 
by FFT (Fast Fourier Transform). We consider 2 images 
that we display in Figure [ 2 ] (first column), and write 0 (k> 
where k £ {1,2}. The first image is a black and white 


drawing of a cathedral, and the second one is the same 
drawing but with a background (a cloud): the first one 
will be approximately sparse and the second one not. It is 
possible to compress these images by considering, instead 
of (%%, the vector Y <k> . The quality of the compression, 
i.e. the proximity between the image reconstructed trough 
Y and the image () lk \ will depend however very much 
on the sparsity of the image, as we saw in Theorem 4.1 We 
can use the results of Section [5] to test whether the image 
is a least 3 %p approximately sparse or not, and then build 
confidence sets around it. Figure [2] (two last columns, the 
first one containing the reconstructed images and the sec¬ 
ond one an extremal point of the confidence sets) illustrates 
this. More precisely, we display, for each image, the esti¬ 
mate of 0 lk> (i.e. the reconstructed image), and an extreme 
points of the confidence sets that we choose as being the 
image that minimises the contrast. 


Although image 1 and 2 are different images, their esti¬ 
mates are very close. The test reveals the fact that they are 
not the same, and that in particular the reconstruction of im¬ 
age 1 will be good while the reconstruction of image 2 will 
be bad (although they seem similar from their reconstruc¬ 
tion). The confidence sets also show how much the true 
image could actually be different from the reconstructed 
image. In particular, the extremal point of the confidence 
set that minimises the contrast implies that although it is 
rather unlikely that there is a background in image 1, im¬ 
age 2 might well have one. For these images, the notion 
of approximate sparsity is very important since even the 
first image is not at all sparse (not even 10%/; sparse). It 
is however less than 3 %p approximately sparse. Because 
of the background, however, the second image is not even 
close to 3 %p approximately sparse. 


Conclusion In this paper, we developed a computation¬ 
ally feasible, adaptive and honest confidence interval, first 
in the two sparsity indexes case, and then in the general 
setting of multi sparsity indexes. The method we propose 
is efficient on a maximal set (in a minimax-optimal sense), 
and is implementable, which is a novelty with respect to 
the existing results. The assumptions we make are also less 
restrictive than what was previously required.We also pro¬ 
vided an experimental validation of this results by simula¬ 
tions, and also an application on images. 
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Supplementary material for the paper: “Implementable 
confidence sets in high dimensional regression” 

A General construction of the confidence sets 

A.1 General construction of a confidence set in the two point case 

Construction of the confidence set Let C,B,c,a 2 ,p > 0, and 5 > 0. Let ,5'o < Si < p. The algorithm for the two 
sparsity indexes. Algorithm [3] contains two main steps. The first step is to construct a test for deciding whether 6 is 
So approximately sparse, or Si approximately sparse. The test consists in first computing an adaptive estimate 0 of 0 , and 
then on thresholding all non-significant components. Then, the testing decision is based on two factors (i) testing whether 
the number of non-zero entries of the thresholded estimate is larger than So and (ii) testing whether the squared residuals 
||f||! are significant or not. If both these quantities are small enough, the test is accepted, otherwise it is rejected. The 
outcome of this is the test T n . The second step is to use this information to construct the confidence set C n . Based on \D n , 

the confidence interval C n will be of diameter of order \J So *° g ^ (if ^„ = 0), or \J S ' io ^ p ) (if = 1). The procedure 
is explained in Algorithm [3] 


Algorithm 3 Two sparsity indexes confidence set 


Parameters: C,B,p, 6, o~,c m , Cm 

set 0 = an adaptive estimate of 6 as in Theorem 


4.1 


D.l 


computed on the first half of the sample only 


Consider from now on only the second half of the sample ( X , Y := second half of the sample) 
set the constants r„ = 3,/ . , c n n" 1 / 2 log(l/A) + 2,/ ( c ™ +1 \ E S 0 io g ( P /s) and , = 2 fZ 

n y min(c m ,l) / / 1 y min(c m ,l) n ’ n y mi 

(l2(36, 


E := 


36) 2 + C 2 ) if (9 is the estimate from Theorem 


set the residual 


4.1 


min(c- 

otherwise E as in Theorem 


J5 Si log(p/<5) 


. 4 ) 


, where 


D.l 


f = Y-xe 


set the test statistic 

Rn = \\f\\l~a 2 

set the test 

4> n = l-l{ J R„<r 2 }l{ <?(,•) <«) 2 }- (A.1) 

f=S 0 +l 

where is the ordered version of \0\ 
set the confidence interval 


C n := {u <E S Sl : ||u - 0\\2 < \Je 
1 Si log(p) 


So log [p) 


1{>IL 


E- 




0 } 


return C n 


Main result The following theorem states that this confidence interval is adaptive and honest (in the sense of Defini¬ 
tion ( ]3.2[ > ) for p n large enough. 

Theorem A.1. Let Assumptions \3 . l\ and \Y2\ be verified for p > Si > Sq > 0, and for constants c,c m , Cm > 0. Let S > 0 
and B > 0 and C > 1 6C'(B 2 +r) where C' is some universal constant. Assume that 


Pn 2 


3 {Cm + 1) 


A/min (c m , 1) 
x min ( V / C , log(l/<5)?r- 1/4 , VE^ Sl lo ^j ZS^. 
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if S o < n 1 / 2 \og(\/8)/ \og(p/5), and p n = 0 otherwise. Set 

V ~V n { Pn ) = S So (C,B,p,8)\JS Sl (C,B,p,8,p n ). 

The confidence set C n constructed in AIgorithm [j ] is 8—adaptive and honest for X = {So, Si} and V. 
The proof of this theorem is in the supplementary material (Appendix [C|. 


Discussion Comparison with results in paper [Nickl and van de Geer(2 01 3)]. The results we obtain hold uniformly on 
any vector of Ss 0 {C,B,p,8 ), and Sg^C, f3,p, 8,p n 
holds by definition of these sets that 


for p n as in Theorem 5.1 Let h(B) be the 1 2 ball of radius B. It 


^o('S'o) H h(B) c Ss 0 (C, B,p, 8), 
and 

lo(Si, (1 + C)p n )r\l 2 (B) C S Sl (C,B,p,8,p n ). 


In particular, our results imply all the upper bounds of |Nickl and van de Geer(2013)| | in all cases (i), (ii) and (iii) 
of Table [T] i.e. imply the upper bounds on exactly sparse sets (this is illustrated in Theorem HD- Also, our confi¬ 
dence set is adaptive and honest in all three cases (i), (ii) and (iii), and we do not need to change the construction 
method as in paper [Nickl and van de Geer(2013)] . Our assumptions on the design of X are weaker than in the pa¬ 
per [|Nickl and van de Geer(2013)l , where the authors consider Gaussian design which a fortiori satisfies Assumption |3.2| 
with high probability. Finally, the confidence set is, as we saw, computationally feasible, since its computational complex¬ 
ity is of order np. As mentioned in the introduction, the procedure in the paper [Nickl and van de Geer(2013)[ boils down 
to minimizing over the set of So— sparse vectors a quadratic quantity, which has complexity of order p s °n. This implies 
that our procedure is computationally efficient on a set that is as large as possible in a minimax sense, as illustrated by the 
lower bounds in Figure [l] 


2 ’ 

if the 


The parameters of the algorithm. The definition of the test T n uses the knowledge of the RIP constants of the matrix X, the 
input of the parameters C, B, p. 8 of the enlarged sets, and the knowledge of the variance a 1 of the noise. The dependency 
on a 2 is undesirable, but there is a way around it by dividing the second half of the dataset in 2 parts, and using the first part 
to estimate er 2 , and the second part to construct the confidence set (plugging in the estimate of a 2 instead of the true value). 
This is explained in Corollary |5.1| The parameters B, C,p, 8 of the enlarged sets are rather given to choose to the user for 
flexibility. The parameter 8 is the desired coverage of the confidence set, the parameters C. p calibrate the desired level of 
approximated sparsity, the parameter B is an upper bound on the I 2 norm of 6. It is reasonable that the user fixes <5, as well 
as the parameters C. p. For instance, taking any fixed constant C > 0 and setting p = 0 will already allow to deal with all 
parameters in Iq(S), see Corollary |5. 1 1 where we choose C large enough ( C = 32) so that the assumptions of Theorem A.l 
are satisfied, and where we set p = 0 in the second part of the corollary. The parameter If 2 , i.e. an upper bound on [(/ 2 
can be estimated with high enough precision by (min(l, c m )) _1 ^n _1 ^A <rj Y^ 2 (l + 21og(l/5)) + 21og(l/5)^, i 
noise is either bounded by 1, or is a Gaussian of variance 1. The RIP constants of the matrix A', namely c m and Cm are not 
computable in finite time. However, it is not necessary for the good functioning of the algorithm to have an exact knowledge 
of these constants. An upper bound on Cm and a lower bound on c m suffice, and this will only damage the performance 
of the algorithm by a constant factor with respect to what could be achievable if c m , Cm were known precisely. Moreover, 
if c m is to small, and Cm is too large, it means that the design is very correlated, and in this case, all known computable 
estimates fail. The lasso, for instance, works only for a quite restricted range of values for the parameters c m and Cm (in 
this paper, we reprove the lasso consistency in the enlarged sets, and obtain that it functions correctly for c m > 2/3 and 
Cm < 4/3 , see Theorem 4.1 this can be slightly improved, but not by much). Taking that into account, it is hopeless 
to hope for a confidence set that will be at the same time computable and adaptive and honest for all values of c m , Cm- 
indeed, as pointed up above, any point of an adaptive and honest confidence set is actually a minimax adaptive estimator. 
In Corollary |5.1 [ we precise all values of the parameters for the construction of the confidence set, under the assumption 
that the lasso estimate is efficient (under the assumptions of Theorem |4. 1 [ >: in this case the values of the parameters are 
fully explicit and the confidence set can be computed without a priori knowledge on the problem. 


A.2 General construction for multiple sparsities 

Construction of the confidence set In the last subsection, we restricted ourselves to constructing a confidence set that is 
adaptive to only two sparsities So, Si. Although it is already useful with respect to existing techniques that are not adaptive 
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at all, it is only a first step toward a more global result where many sparsity indexes X are considered (potentially, X can be 
the whole set of indexes {1,... ,p}). 

Let I = {*i,... ,i/}, where i\ < *2 < ■ ■ • < fj be the set of sparsities over which one wishes to adapt. An intuitive 
extension of the two sparsities index confidence set to this more complex setting is presented in Algorithm [4] The idea 
is to compute the tests T„ for the adjacent sparsity indexes in X , for smaller and smaller sparsities, until the sparsity 
index t where the test is rejected (it-i is rejected). The confidence band considered is then chosen of diameter of order 
\Jit log(p)/ n, and is centred in an adaptive estimate 0. 


Algorithm 4 Multi sparsity indexes confidence set 
Set vp n : = 0, t = / 

while = 0 do 

Set T,, and C n as the test and confidence interval outputted by Algorithm [3] with parameters So '■= it- i, S i := it 

end while 
return C n 


Main result This confidence set will be adaptive and honest for a model that respects the restriction of all sparsity indexes 
in X. 


Theorem A.2. Let Assumptions |T 1 \ and 3.2 be satisfied for p > 0, and for constants c, c m , Cm > 0. Let S > 0 and B > 0 
and C > 1 6C'(B 2 + c 2 ) where C’ is some universal constant. Let, for any 2 < t < I 


Pn(t) > 


3(Cm + 1 ) 

y/min (c m , 1) 


x min ( VCHim n-™ SM>), 


if it -1 < n 1 / 2 log(l/5)/\og(p/5), and p n (f) = 0 otherwise. Set 

i 

V := Si, (C, B,P, S) U U (C, B,P, S, p n (t)). 

t =2 

The confidence set C n presented in Algorithm's 6—adaptive and honest for X and V. 

This theorem follows immediately from Theorem |A. 1| and an union bound over all tests. 

B Proof for Theorem 14.11 

Step 0: Compatibility condition We state the compatibility condition for X, or restricted eigenvalue condition 
(see |Bickel et. al.(2009)1 ). 

Assumption B.l. Let <f> > 0 and p > 0. Let U be any support of size S < p of {1 ...p}, u be any vector, and ujj be the 
restriction of u to U. The (p, </>)— compatibility condition is satisfied if for any such U,u such that ||uj/c||i < 4||u{/||i 
where U c is the complement of U, it holds that 


IMI? < 


ncf> 2 


Note that this assumption is implied by the R.I.P. property with not too large R.I.P. constant for 33 p vectors, i.e. c m > 2/3 
and Cm < 4/3 in Assumption|3.2|with 33 p. To deduce this, combine Assumption 2 in flBickel et. al.(2009)] , with m = 32 p 
and c m > 2/3 and Cm < 4/3, with the definition of 11Lip, to, 4) and Lemma 4.1 in |Bickel et. al.(2009)) , to obtain that 
this RIP property implies this compatibility condition with cj> 2 = 1/6. See also {Zou(2006)|. 


Since in the assumptions of Theorem [ b] we assume Assumption [T2] with 66p, c m > 2/3 and Cm < 4/3, we know from 


the previous remark that Assumption B.l holds with 2 p and dr = 1/6 . 
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Step 1: Decomposition of the problem Let 8 € S,s(C, B. p.S). Let 0 = ()\ + d 2 , where 0\ contains the S'—largest 
components of 8 , and 82 the rest. We have for any vector u £ 

||y - Xu\\l = ||e + X 8 2 HI + 2(e + X 8 2 ,X( 8 1 - «)) + ||X(0i - u )\\ 2 2 

= v 2 + || X{ 8 1 - u)||! + 2 (e,X( 8 1 - «)) + 2<X0 2i .Y( 0 1 - u)>, (B.l) 

where u 2 = ||e + 2T0 2 ||| which does not depend on u. 

Note first that 

\(X 8 2 ,X( 8 1 -u))\ < \\X 8 2 h\\X{ 8 1 -u )\\2 

<C M M\ 02 h\\X( 6 i-u )\\ 2 

< C M VCSXog(j)/5)\\X{8 1 - u)|| 2 , IB.2) 

by definition of X, since 8 € Ss(C, B,p, 5) and is thus p sparse, and since d 2 contains all entries of 8 G Ss(C , B,p, 5) 
that are smaller than the S largest (and thus, ||0 2 || 2 < c51 °^ p ^ ). 

We have since e is a centred sub-Gaussian process, and by Holder’s inequality and an union bound, that with probability 
larger than 1 — 5 

\{e,X{ 8 i - u))| < ||X T e|| oo ||(0i - u)||i < Dc\/\og{p/5)n\\8i - u||i, (B.3) 

where D is some universal constant. It holds since for any j < p, with probability larger than 1 — <5 , (X j , e) < 
Dc^/\og{l/ 8 )n by Holder’s inequality (since e sub-Gausssian). Then an union bound on all j < p leads to the bound ( |B.3[ i. 

Equations Q to ( |B ,3| > imply that with probability larger than 1 — 5 

\\Y - Xu \\ 2 >v 2 + || X{ 8 1 - u )\\ 2 - Dcy/\og(pl6)n\\0i - u||r - C M y/CS\og{p,/5)\\X{8 1 - u)|| 2 . (B.4) 


Step 2: Lasso estimator Define now the estimator 8 as being 


8 = arg min 


I Y - Xu\\l + K\/log(p/5)n||it||i 


where k > max(2H, ^-A 2 ) where A := max(Bc, CmVC). 

By definition of 8, we have if we compare it to the point u := 0 \ that 

||F - X8\\ 2 + «\/log(p/5)n||0||i < ||y - X04% + Kv'log(p/(S)n||0 1 || 1 . 


From Equation ( |B.4[ ) and Equation ( |B. 1 [ 1 , this implies 

v 2 + \\X(8 1 - 0)||| - Dcy/logip/^np! - 0||i - C M y/CS\og(p/S)\\X(8 1 - d)\\ 2 + ^\og(p/5)^8^ 

<v 2 + «\/log(p/<S)n||0i||i, 

which is equivalent to 

ll^(0t - 0)||1+ «\/log(p/<5)n||0|| 1 

< Dc^/\og(p/5)n\\8i - 0||i + CmV cs log(p/5)||X(0i - 0)|| 2 + K\/log(p/5)n||0i||i. 

This implies since A = max(Dc, CmVC), since 8\ is S'—sparse, that with probability 1 — 8 

IIW - ^)lll + n^gip/SMpeclU + ||0 e ||i) 

< Kv/log(p/i5)n||(0i)e||i + ^ V / log(p/5)(\/n||(0i)e - (0)e||i + \/n||0eo||i + \/S||-X'(0i - 0)|| 2 ), 

where 0 is the support of 8 1 , 0 C is its complement, and for any vector u, uq is the restriction of u to 0 (same applies to 
the complement). This implies since k > 2 A that with probability 1 — 5 

I Wt - 0)112(11^(0! - 0)|| 2 - ^v/log^/jjs) + tt/2y/log{p/S)n\\dQC ||t 

< 3«/2\/log(p/5)n||(0 1 )e - 0e||i- (B.5) 
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It holds with high probability that : || X{6\ — 0)|| 2 < ^i/log (p/6)nS. Assume that \\X(9 — $i)||| > 4A 2 S\og(p/S). 
In this case, the previous equation implies that with probability 1 — <5 


\\ X (°1 ~ ml/2 + K /2V^I^\\e e oh < 3«/2\/log(p/(5)n||(0i)e - 0©||i, 

which implies in particular 

||(0i) e c — 0e c lli < 3||(0i)e — 0©||i, 

since (6{)qc = 0. By the compatibility condition, we thus know that for this vector. 


(B.6) 




n(jj 


,2 ’ 


which implies that with Equation (|B.6|) that 


\\X{e l -6)h<^s/\og{v/5)S, 
and so in every case, since > 4 A 2 , we have with probability 1 — <5 

\\x(e 1 -9)\\ 2 <^^/iog(p/s)s. 


(B.7) 


By combining this with Equation (|B.5|), we obtain 


3 kA 


\\ x (°i ~ 0)\\l + K/2^/\og(p/5)n\\eQc\\ 1 < 3«/2\/log(p/(5)n||(0i)e - 0©||i + — j-S\og(p/5). (B.8) 


Case 1: ||(0i)© — 0©||i < l °s(p/ s ) t This condition implies with Equation ( |B.8[ > that 

||X(0! - 0)||| + «/2v'log(p/<y)n||0i - 0||i 

= \\ x (Qi ~ 0)||| + K/2^\og(p/6)n\\(0 1 ) e - 0©||i + K/2^/\og(p/6)n\\0 e c\\ 1 

< K /2\/log(p/5)n||(0i)e - 0©||i 

_ A q k a 

+ 3K/2 x /log{p/6)n\\(6 1 )e - 0©||i + —— Slog(p/6) 

9 

_ A q k a 

< 2/c v /log(p/<5)ra||(0i)© - 0©||i + ——Slog(p/S) 

9 

< 6 ^ 4 5 , log(p/<5). 


Case 2: ||(0i)© — 0©||i > ^^.Sy log ^ s ' ) . This condition implies with Equation ( |B.8| > that 

||(0i)©c — 0©c||i < 4||(0 1 )e — 0©||i, 

since (0i)©c = 0. By the compatibility condition, we thus know that for this vector. 


|, rM M2 TO -0)||1 

ll^i)e ^ »e||i <-^-> 


which implies by Equation (|B.7|) that 


IKMe-Sel 


and which implies together with Equation m that with probability 1 — 5, 

'3 k 2 3/o 


||X(6»i - 0)||1 + «/2\/log(p/5)n||0i - 0||i < + ^Slog(p/S). 
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Conclusion. From the two above cases, we know that with probability 1 — <5, 

3k 2 6k, 


This implies that 


and 


ll*(0i - 9)\\l + n/2 x /\og(p/6)n\\9 1 -0||i < + -^-(A+lj^Slogip/S). 


I Wi - 0)111 < + -j(A + 1)) S log (p/6), 


ii ft - s ii^(S +Hi r^) s / l0S<P/{) 


(B.9) 


(B.10) 


Let a = 6\ — 6, and < 2 q be the order version of \a\. Note that the last equation implies that |a(j)| < + 

^T 1 ) and thus 


Y a u) - Y & 


j=s +1 J=s+1' 


^6k + 12(A + l) ^g, /log(p/5) 


6k i 12(^4 + l)^ 2 c log(p/(5) 
o . 


<-<s+^r 


(B.l 1) 


Assume that 9 is such that 


A /Q k 2 v 

l(0i)t/-(0Vll2< (—+6k(A + 1)) 


S log (p/8) 


where U is the support of the S largest entries of the vector \6\ — 6 1, i.e. 

S . O „2 


2 ^ ( 3k 2 , \S\og(p/5) 


3 = 1 


Combining this with Equation (|B.l 1 [) implies 


Y a h ^ 

j=i 


(it + 6k{a + 1} ) +12 (^ + +1} ) 


6k 12, 


Slog(p/S) 


which concludes the proof since Y^=i a %) = ||^i — 0||| 
Assume now the contrary, i.e. that 6 is such that 


„ /S K 2 \ 

|(0t)t/ - {0)u\\l > (— + 6k(^4 + 1)) 


S log (p/8) 


From Equation ( |B.9| ), this implies that the compatibility condition is not satished for 0 - 0 \, and we thus know that for U 
since U is a support of size S 

||(0i)t/ - Ouc ||| > 4||(6>i)(7 - Qjj ||2, 


which is equivalent to 


From Equation <| H. 1 1 [), we have 


v v 

Y a h - Y 

3=1 j=S +1 


«(i)- 


P P 


Y a h- Y - 12 (tt + ~Z( A + 1 )) s 


,log {p/6) 


j =1 J=s+1 


and this completes the proof by summing these two terms, using the fact that 11 9 2 \ | 2 < C ^/ S '°f /5) , and that 9 = 6 1 +d 2 . 
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C Proof for Theorem lA.il 

C.l A related testing problem 

Set 


Pn > 


y / min(c m , 1) 

x min(v / CH(T7 


/2(C'm + 1) jSolog{p/5) 
min(c m , 1) 


We cast the testing problem 


H 0 : 9 £ S So (C, B.p, 5) vs. Hi : 6 £ S Sl (C, B,p,6, p n ). 


(C.l) 


Let 9 £ Ss(C , B, p. 5) for some 5 > 0. By definition of 9 and Theorem 4.1 we know that with probability larger than 

1-5 


\\9-9\\i<E 


2 / jpS\og{jp/5) 


(C.2) 


Consider now the vector (fi)igx> 2 , where V 2 is the second half of the data set. By definition, for 1 £ V 2 , 


p 


(■ h ) 2 - CT 2 = ^ + G 


G =1 

P 


— ( E Xi >j(9j I + 2 ( E Xi >iWi _ ) e * + ( e i “ <t2 )- 

g =i / Vf=i 


(C.3) 


Note first that since the e* are independent sub-Gaussian random variables, the elements of (r*)te{i,...,n} are independent 
among each others. 

The mean of the first elements in the decomposition in Equation < fC3> of fi, namely i||A'(0 - 6) ||| = 

1 (Ej=t x i,j(9j — 9j)j > i s such that (by Assumption^. 2 ) since 9 — 9 is p sparse, see proof of Theorem^ 




4.1 


1 


c ro ||0-(^< -||A(0-0)|| 2 <CW||0-0|& 


(C.4) 


The second term in the decomposition, by remarking that ^ X i3 (9 :j — 0^)^ e^ is a centred sub-Gaussian, 

uncorrelated, random process with tail coefficient bounded by -c||X(0 — 0) || 2 < C m c\\9 — 9 H 2 , we obtain by Hoeffding’s 
inequality that with probability at least 1 — 5, 


n p 


- E 2 E X ^j - 9j) e* < C'C M c ||0 - 


1=1 \j =1 


log(l/5) 


(C.5) 


Finally, the last term of the decomposition being also a sub-Gaussian random variable with tail coefficient bounded by c 2 , 
we have by Hoeffding’s inequality, with probability at least 1 — 5, 


1 


E( e ?-- 2 ) 


<ccM l/5) 


(C.6) 
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Finally, by Equations ( |C.3| > to ( |C. 6 [ ), by definition of R n and with probability larger than 1 — AS, 

Cm\\9 - 9\\l - C'c{C M \\0 - 9\\ 2 + 1)1 


log(l/<5) 


< Rn < C M \\0 - 9\\i + C'c{Cm\\0 - e\\ 2 + 1 ) 


log(l/5) 


(C.7) 


which implies 

Cm\\9 - e\\l - Cn~ 1/2 log(l/5) <R„< Cm\\ 9 - 9\\l + Cn~ 1/2 log(l/5), 
since by definition C > 16C' (C]^B 2 + c 2 ) and j| 6 — 0||| < 4 B 2 as both vectors are in l 2 (B). 

Under the null hypothesis H 0 ■ Let 9 £ Ss 0 (C, B,p , S). Let us re-order the coefficients of 9 as 9 2 ^ > > ... > 9 2 p y 

Equation (|C. 2[> imply that with probability larger than 1 — <5, 


E 

j=S 0 +l 


So log (p/6) , 2 

-1- < \ T n) ■ 


Also Equations ( |C.2| i and ( |C.7[ ) imply that with probability at least 1 6, 

R n < Cn- 1 / 2 log(l/<5) + C M E SMP/S) < r 2 , 

n 

by definition of r 2 . This implies that the test is accepted with probability larger than 1 — <5. 

Under the alternative hypothesis Hi. Assume now that 9 £ Ss 1 {C, B,p,6, p n ). By definition of the set 
Ss 1 (C,B,p,S, p n ), we know that 


pI>\\9-s S o \\ 2 2 > 

j=S 0 +l 


(C. 8 ) 


where p„ = 


Pn = y==Ty ( min(3v / C'n- 1 /2log(i/ ( 5) i ^+ 20g g ° lo «< p/ * ) ). 


Case 1: p 2 n = . . 1 

•t / min(c, 

1-5, 


= ( 3 / 


log (p/S) 2 - /^So]og(p/5) 


. Equation (|C.2|> imply that with probability larger than 


y- q 2 > 1 

jJ^+1 U) ~ V min ( C ™ 5 1) 


3£ S i log (p/5) _ E S 1 n\og{p/S) 


= 2 E 


Si log(p/ S) ^ OE ,S 0 log(p/5) ^ t _, 2 


> 2 £- 


> K) 


n n 

since min(c m , 1) < 1. This implies that the test is rejected with probability larger than 1 — 5. 

Case 2: p 2 n = i} (~ x ! 2 log(l/5) + 2yT 

test is rejected. Assume now that 


,S 0 log (p/S) 


• If E?.*,+! <%) > > K) 2 , the,, the 


E <%)S 2E 

i=s 0 +i 


So log(p/5) 


This implies in particular, since X/j=s 0 +i @fk) > \j / AE^ P g ^ p ^ S> + n 1 / 4 that 

■.(zVCn- 1 / 2 + 2\jE 


p-9\\ 2 2 > 


1 


> 


> 


\/min(c m , 1 ) 

1 — |3v / c^=v5+ ll E s«ioe(pm | T 
,1) 2 V n 


S 0 log(p/5)\ , /o 77,^0 log(p/<5) 


- \ 2E- 


i(c 


1 


min(c m , 1 ) 


4^ 
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Equation ( |C.7| > and the previous equation imply that with probability at least 1 — 4<5, 

Rn > Cm ||0 - e\\\ - Cn- 1 / 2 > dCn- 1 ' 2 + _ cn~ 1/2 > r 2 , 

An 

by definition of t 2 . This implies that the test is accepted with probability larger than 1 — <5. 

Conclusion Finally, we have by combining the results under Hi and Hq that the test is 25 —uniformly consistent for such 
a p 2 , i.e. it satisfies 

sup Ee'I'n + sup Eg[l — T'n] < 25, (C.9) 

e&s So (c,B,p,s) ees Sl (c,B,p,s,p n ) 

by just summing the bounds on the probability of error under Ho and // ( . 

C.2 The confidence set is honest and adaptive 
The confidence set is adaptive and honest on V n (p n ) 

Let 9 £ V n (p n ). Let S = So if 9 £ <5>s 0 , and S = S\ otherwise. Let S = So is 4'„ = 0 and S = Si otherwise. We have 

P e(9 £ C n ) = P e (||0 - ef 2 < e ^ 1o ^ p/S) ) 

n 

>P e {\\0 - 6\\l < E Slog ( p / S \ S = S) 
n 

> 1 - P*(||0 - 9\\ 2 > E Sl ° g{p/6) ) - P e (S ^ S) 

n 

>1-3 5, 

by definition of 9 as an adaptive estimate, and by 4/„ is 25 —uniformly consistent on V n {fin)- 
Moreover 

M\Cn\l < E Slog ^ P,5) ) = P 9 (S = S)>l-26, 
n 

'T„ is 25 —uniformly consistent on V n {pn)- 

The confidence set is thus 35 —adaptive and honest on V n {p n ). This concludes the proof in cases (i) and fii) of Figure [T] 
since in these cases p n > p n . 

Case (iii) of Figure[I](S'o > n 1//2 /log(p) and p n = 0) 

Consider now 9 £ Ss 1 (C, B,p, 5) \ V n (p n ). By definition, 

\\0 - ‘S's'o111 < Pn < jCn-V*\og{l/6) + 2^E SqX °^ P,5) < (VC + 2^)^ S ° l °^ p/ ^ , 

since So > n 1 ^ 2 \og(l/5)/ \og(p/5). This implies that 6 £ Ss 0 ((VC+ 2\/E) 2 , B,p, <5), and we have since 9 is an adaptive 
estimate that with probability L - 5 

So in this case 

P e{9 € C n ) = P 9 (||0 - 9\\\ < e^M 5 !) 

n 

> P e (||0 - 9\\ 2 2 < E Salog ^ p/5 \ s = S) 
n 

>1-5. 

So the confidence interval is honest and adaptive for these parameters also, and thus it is honest and adaptive for the entire 
set Ss 1 ( C , B,p,5) (i.e. p n = 0). This also concludes the proof in this case. 
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D Proof of existence of an estimate in the general case 

Theorem D.l. Let Assumptions \ 3. 1\ and \3.2\ be satisfied for some c, c m > 0. cm , 2/1 > 0. Let B > 0 and C > 0. Let 
6 > 0. There exists an estimate 9 of 6 that is such that for some universal constants D,G > 0 that depends only on 
C,B,c,c m , Cm, we have 


VO <S<Gp, sup ^e(\\9-9\\l > ES log ^ p ^^ ) < 6. 

9gs s (c,b,p,s) ' n / 

Proof Step 1: Decomposition of the problem 

Let 9 £ Ss(C , B,p, <5). Let 9 = 9 i + 02, where 9\ contains the S'—largest components of 9, and 02 the rest. We have for 
any estimator 9 


\\Y - Xu\\\ = ||e + X9 2 \\\ + 2(e + X9 2 ,X(9 1 - u)) + || X(9, - u)\\ 2 2 

= v 2 + 2(e, X(9 1 - «)) + 2(X02, X{0 x - u)) + || X{9 1 - u)|||, (D.l) 

where v 2 = ||e + 2 f 02 ||| does not depend on u. 

Note first that 


\(xe 2 ,x(e 1 -«))| < \\x9 2 \\2\\x(9 1 ~ u )\\ 2 

< G M \/n||02||2||^(0 1 — tt) 112 

< Cm y/CS\0g{p/5)\\X{9 l - u)|| 2 , (D.2) 


see Equation ( |]L 2 | >. 

Let k > 0. In the proof of Lemma 2 in the paper [ |Nickl and van de Geer(2013)| (third equation on P 19), we have that for 
any b > 0 there exists a constant E > 0 such that for any u > 0 

pf sup |(e, A"(0i — u))\ > Eby/2klog(j)) + b\/2u\ < exp(— u). 

V u.:ixeX 0 (fc),ILY(u.—6> 1 )|| 2 <fc ) 

The proof of this equation in the paper [ [Nickl and van de Geer(2013)| is based on the fact that the e— covering number of 

/ \ k -\-1 

the set {u : u £ lo{k), ||X(u — 6 >i) || 2 = b} is upper bounded by (p(^r^) ) and then on techniques that feature mainly 
Dudley’s entropy bound, and Borell Sudakov Cirelson concentration inequality (or also Talagrand’s inequality). 

Last inequality implies that for any 6 > 0, we have that on some event £k,b{$) of probability larger than 1 — <5 (setting 

u = log(l/5)) 


sup \{e,X{9i - u))| < Eby/2k\og(p) + by/2\og(l/S). (D.3) 

u:u£lo(k) ,\\X (u—0i)\\2<b 

Let now v be such that v £ lo(k), ||X(v — ^i)|| 2 > b. Let c > 1 be such that ||X (v — 6*i)|| 2 = cb. Since v is fc-sparse, and 
0i is k— sparse, there exists u such that u £ lo(k + S), u — 0i = ^{v — 9f) and || X{u — 0 1 )||| = ^\\X(v — 0 1 )||| = b 2 . 
Also by Equation m (applied in k + S ), we know that for this u, on the event G.'+S.iff) where Equation ( |D.3| > (applied 
in k + S) is satisfied 


|(e, X(u - 0O)| < Eby/2(k + S ) log(p) + by/2 log(l/<5). 

So since u — 9 i = c[v — 0i), we have also that on the event £ k+S,b(b ) 

| (e, X(v - 0i))| < Ecby/2{k + S ) log(p) + cby/ 2 log(l/<5). 

This implies on particular that on the event £k+s,b(6) of probability larger than 1 — 6 

sup sup \( e ,X{9i - uff \ < + S ) log(p) + 6 a/ 2 log(l/< 5 ), 

c> 1 u:u£lo(k) ,\\X (u—0i)\\2<cb ^ 
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and this holds for any b > 0 with E that depends on b. Setting 6=1 and considering the associated E, this implies that 
with probability larger than 1 — 6 


sup sup 

b>l u:u€H 0 (k),\\X(u— #i)||2<& 


l(e ’ A( ^ M))l < C72(fc + S)log(p) + y/21og(l/6), 


and by an union bound on all k < p, we get that with probability larger than 1 — 6 

sup sup o ~ - c ' 1 / 21 og(p) + \/ 2 log (p/S)/(k + S) < D v / log(p/ 6 ). 

6>l,fe<pu:ueto(fc),||X(u-ei)||2<6 OV* + * 

where D is some universal constant. This implies that with probability larger than 1 — 6 , for any u € R p , we have 

|(e, X(0i -u)}\< D(\\X(9i - u )|| 2 + 1 ) v / (fc + S) log (p/5). (DA) 


Equations ( |D. 1 1 >. ( |D.2[ i and ( |D.4[ i imply that with probability larger than 1 — 6 , 

\\Y - X9\\\ >v 2 + || X(9! - 9)\\l - 2D(\\X(9 1 - u)|| 2 + 1) y/{k + S) log(p/ 6 ) 
- 2C M VCS\og(p/S)\\X(0 1 - u )|| 2 

> v 2 + \\X(9 1 - 0)||1 - A{\\X(6 1 - u )|| 2 + l)y/(k + S) log(p/ 6 ), 


where A := 2(D + CmVC). 

Step 2: Definition of the estimator 

Define now the estimator 9 as being 


9 = arg min 

U 


||y-Xu||! + Klog(p/ 6 )||u||o , 


(D.5) 


(D.6) 


where k > 2A(F + 1 ), where F = yA(-\/2 + 4A + 2\[A + 1) + 1 j. Let k be the sparsity of 9 from now on. 

Since 9 is the minimizer of the above formula, it is in particular such that (with u = 0\) 

||y - X0||! + Klog(p/6)||0||o < \\Y - X9 1 111 + «log(p/6)INIo, 

which implies together with Equations ( |D. 1 1 ) and ( |D.5[ i 

||XT(6>i - 0)||1 - A(\\X(9i - u)|| 2 + 1 )y/(k + S) log (p/6) + n\og(p/5)k < Klog(p/6)||0i|| o . (D.7) 


Step 3: Proof that with high probability, ||X(0i — 0 )|| 2 < (v/2 + 4/1 + 2y/A)y/(k + S) log (p/5). 
Assume that 0 is such that ||X(0i — 0)||1 > 2A(||A'(0i — u )|| 2 + l)\/(fc + S) log (p/ 6 ). 

Equation ( |D.7| ) implies that 

||X( 0 r - 0 )||l /2 + log(p/ 6 )|| 0 || o < log(p/ 6 )|| 0 1 || o , 
which implies that || 0 || o< Iloilo <S. By the previous equation this implies 

||X( 0 1 - 0 )|| 2 < V 2 -Slog(p/ 6 ). 


This implies that in any case (also when ||AC( 6 *i — 0)||1 < 2A(\\X(9i — u )|| 2 + 1) \J(k + S ) log(p/ 6 )) 

||X(0i - 0 )|| 2 < (y/2 + 4A + 2VA + l)y/(k + S) log {p/5) = Fy/(k + S) log(p/ 6 ). (D. 8 ) 


Step 4: Bound on the sparsity k of 9. 
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From Equations and ( |D.8| ), we deduce that 

k log(p/S) (k — S) < A(F + l)(k + S) \og(p/S). 
Since k > 2 A(F + 1), it implies that 

k< (A(F+1) + k)S := GS. 

Step 4: Conclusion. 


Since k < GS < p. Assumption 3.2 applies to 0\ — 9 

\\X(0 l -d)\\2>C m yfc\\0 l -9\\2. 

This implies together with Equation ( |D.8| ) that with probability larger than 1 — <5 

||0i-0|| 2 < — (v / 2 + 4A + 2v / A+l)t/^^. 
c m V n 

This concludes the proof using the fact that 116*2112 < C\J anc j that Q = Q 1 - \-Q 2 . 


□ 








